引用视频对象细分旨在预测视频中自然语言表达式引用的对象的前景标签。先前的方法要么取决于3D convnet,要么将附加的2D转向器作为编码器,以提取混合时空特征。但是,由于在解码阶段发生的延迟和隐式时空相互作用,这些方法遭受了空间错位或虚假分散因素的影响。为了解决这些限制,我们提出了一个语言桥梁的双链传输(LBDT)模块,该模块将语言用作中间桥,以在编码阶段早期完成显式和适应性的时空交互。具体地,在时间编码器中进行了交叉模式的注意,将单词和空间编码器引用以汇总和传递与语言相关的运动和外观信息。此外,我们还提出了在解码阶段的双边通道激活(BCA)模块,以通过通道激活进一步降低并突出时空一致的特征。广泛的实验表明,我们的方法在四个流行的基准测试基准上获得了新的最新性能,分别在A2D句子和J-HMDB句子上获得了6.8%和6.9%的绝对AP收益,同时消耗了大约7倍的计算机开销。
translated by 谷歌翻译
对比学习一直吸引着学习无监督的句子嵌入。当前的最新无监督方法是无监督的SIMCSE(UNSUP-SIMCSE)。 Unsup-Simcse将辍学作为最小数据增强方法,并将相同的输入句子传递给预训练的变压器编码器(带有掉落的掉落)两次,以获取两个相应的嵌入式以构建正对。由于句子的长度信息通常会由于使用嵌入变压器中的位置嵌入而编码到句子嵌入中,因此Unsup-Simcse中的每个正对实际上包含相同的长度信息。因此,接受这些正面对训练的Unsup-Simcse可能是有偏见的,这往往会考虑到语义上相同长度或相似长度的句子更相似。通过统计观察,我们发现Unsup-Simcse确实存在这样的问题。为了减轻它,我们应用了一个简单的重复操作来修改输入句子,然后分别将输入句子及其修改后的对应物传递给预训练的变压器编码器,以获取阳性对。此外,我们从计算机视觉社区中汲取灵感,并引入动量对比度,从而扩大了负面对的数量,而没有其他计算。提出的两种修改分别应用于正和负对,并构建一种新的句子嵌入方法,称为增强的Unsup-Simcse(ESIMCSE)。我们在几个基准数据集W.R.T上评估了所提出的ESIMCSE,语义文本相似性(STS)任务。实验结果表明,ESIMCSE的表现优于最先进的undup-Simcse,而Bert基碱的平均长矛相关性为2.02%。
translated by 谷歌翻译
对比度学习已逐渐应用于学习高质量的无监督句子嵌入。据我们所知,在以前的无监督方法中,最新的最新方法是无监督的SIMCSE(Unsup-Simcse)。 Unsup-Simcse在训练阶段使用Infonce1Loss功能,通过将语义上相似的句子拉在一起并分开不相似。从理论上讲,我们希望在Unsup-Simcse中使用较大的批次,以在样本中进行更充分的比较并避免过度拟合。但是,增加批量的大小并不总是会导致改进,而是在批处理大小超过阈值时会导致性能降解。通过统计观察,我们发现这可能是由于在批量生产大小后引入了低信心负对。为了减轻这个问题,我们在Infonce损失函数上引入了一种简单的平滑策略,称为Gaussian平滑infonce(GS-Infonce)。特别是,我们将随机的高斯噪声向量添加为负样品,它们的负面样品空间的平滑性。简单,提出的平滑策略为Unsup-Simcse带来了重大改进。我们评估GS-INFONCEON标准语义文本相似性(STS)任务。 GS-Infonce的平均长矛人相关性优于最先进的Unsup-Simcse,在Bert-Base,Bert-Large,Roberta-Base的基础上,长矛人的相关性为1.38%,0.72%,1.17%和0.28%和罗伯塔·洛尔格(Roberta-Large)。
translated by 谷歌翻译
文档级别的情感分析(DSA)由于含糊的语义链接并使情感信息复杂化,因此更具挑战性。最近的工作专门用于利用文本摘要,并取得了令人鼓舞的结果。但是,这些基于摘要的方法没有充分利用摘要,包括忽略摘要和文档之间的固有交互。结果,他们将代表限制在文档中表达主要点,这高度表明了关键情绪。在本文中,我们研究了如何有效地产生具有明确的主题模式和情感环境的歧视性表示。提出了一个分层互动网络(HIN),以探索多个粒度的摘要和文档之间的双向交互,并学习以主题为导向的文档表示情感分类。此外,我们通过使用情感标签信息来完善HIN来学习基于情感的重新思考机制(SR),以学习更感知的文档表示。我们在三个公共数据集上广泛评估了我们提出的模型。实验结果始终证明了我们提出的模型的有效性,并表明HIN-SR优于各种最新方法。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
Learning the underlying distribution of molecular graphs and generating high-fidelity samples is a fundamental research problem in drug discovery and material science. However, accurately modeling distribution and rapidly generating novel molecular graphs remain crucial and challenging goals. To accomplish these goals, we propose a novel Conditional Diffusion model based on discrete Graph Structures (CDGS) for molecular graph generation. Specifically, we construct a forward graph diffusion process on both graph structures and inherent features through stochastic differential equations (SDE) and derive discrete graph structures as the condition for reverse generative processes. We present a specialized hybrid graph noise prediction model that extracts the global context and the local node-edge dependency from intermediate graph states. We further utilize ordinary differential equation (ODE) solvers for efficient graph sampling, based on the semi-linear structure of the probability flow ODE. Experiments on diverse datasets validate the effectiveness of our framework. Particularly, the proposed method still generates high-quality molecular graphs in a limited number of steps.
translated by 谷歌翻译
Deep neural networks are vulnerable to adversarial attacks. In this paper, we take the role of investigators who want to trace the attack and identify the source, that is, the particular model which the adversarial examples are generated from. Techniques derived would aid forensic investigation of attack incidents and serve as deterrence to potential attacks. We consider the buyers-seller setting where a machine learning model is to be distributed to various buyers and each buyer receives a slightly different copy with same functionality. A malicious buyer generates adversarial examples from a particular copy $\mathcal{M}_i$ and uses them to attack other copies. From these adversarial examples, the investigator wants to identify the source $\mathcal{M}_i$. To address this problem, we propose a two-stage separate-and-trace framework. The model separation stage generates multiple copies of a model for a same classification task. This process injects unique characteristics into each copy so that adversarial examples generated have distinct and traceable features. We give a parallel structure which embeds a ``tracer'' in each copy, and a noise-sensitive training loss to achieve this goal. The tracing stage takes in adversarial examples and a few candidate models, and identifies the likely source. Based on the unique features induced by the noise-sensitive loss function, we could effectively trace the potential adversarial copy by considering the output logits from each tracer. Empirical results show that it is possible to trace the origin of the adversarial example and the mechanism can be applied to a wide range of architectures and datasets.
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译